Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces
نویسندگان
چکیده
The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of log ( ) 2 1 M + features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models. DOI: 10.4018/jdwm.2012040103 International Journal of Data Warehousing and Mining, 8(2), 44-63, April-June 2012 45 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. classes of objects are present in subspaces of the data dimensions. For example, in text data, documents relating to sport are categorized by the key words describing sport, while documents relating to music are represented by the key words describing music. The other characteristic is that a large number of dimensional features are uninformative to the class feature. That is, many features are only weakly correlated to the class feature, if at all and have a low power in predicting object classes (Saxena & Wang, 2010). The random forest (Breiman, 2001) algorithm is a popular classification method used to build ensemble models of decision trees from subspaces of data. Experimental results have shown that random forest models can achieve high accuracy in classifying high dimensional data (Banfield et al., 2007). Interest in random forests has grown in many domains where high dimensional data is prominent, including domains such as bioinformatics (Pang et al., 2006; Diaz-Uriarte & De Andres, 2006; Chen & Liu, 2005; Bureau et al., 2005), medical data mining (Ward et al., 2006) and image classification (Bosch, Zisserman, & Muoz, 2007). Several methods have been proposed to build random forest models from subspaces of data (Breiman, 2001; Ho, 1995, 1998; Dietterich, 2000). Among them, Breiman’s method (Breiman, 2001) has been popular due to its good performance compared to other methods (Banfield, Hall, Bowyer, & Kegelmeyer, 2007). Breiman uses a simple random sampling from all the available features to select subspaces when growing unpruned trees within the random forest model. Breiman suggested selecting log ( ) 2 1 M + features in a subspace, where M is the total of independent features in data. This works well for data with certain dimensions (e.g., less than 100 features) but is not suitable for very high dimensional data consisting of thousands of features. In contrast, for very high dimensional data, Breiman’s subspace size of log ( ) 2 1 M + is too small. Such data are dominated by uninformative features which have very low predictive power with respect to the target classification. Using a simple random sampling results in informative features not being included in subspaces (Amaratunga, Cabrera, & Lee, 2008). As a result, weak trees are created and classification performance of the random forest is significantly affected. To increase the chance of selecting informative features in subspaces, the subspace size has to be enlarged extensively. However, this increases the computational requirements of the algorithm and increases the likelihood of the resulting trees being correlated. Correlated trees reduce the classification performance of a random forest model (Breiman, 2001). To address this problem Amaratunga (Amaratunga, Cabrera, & Lee, 2008) proposed a feature weighted method for subspace sampling. The weight of a feature is computed with respect to the correlation between the feature and the class. The weights are treated as the probability with which a feature is selected for inclusion in a subspace. Using this feature weighted method to sample subspaces there is a high chance that informative features are selected when growing trees for a random forest model. This method might be compared to the method of Adaboost (Freund & Schapire, 1996; Qiu, Wang, & Bi, 2008) which selects training samples according to the sample weights computed from the result of the previous classification. Such a method increases the probability for selecting informative features for inclusion in each subspace. This results in an increase in the average strength of the trees making up the random forest model, and thus the generalization error bound is reduced. Consequently, classification performance of the random forest model is increased. Amaratunga’s method is only valid for two-class problems, using the t-test of variance analysis to calculate the feature weights. In this paper, we propose a feature weighting method for subspace selection to solve multi-class problems. Instead of the t-test we calculate the chi-square statistic or information gain ratio as the feature weights. Both measures capture the correlation between a feature and the class for multi-class problems. The larger the weight, the more informative the feature to the classification. We normalize the set of weights to 46 International Journal of Data Warehousing and Mining, 8(2), 44-63, April-June 2012 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. sum to 1 and treat the normalized weights as the probability of selecting a feature within a subspace. We then use weighted sampling to randomly select feature subspaces with a high chance of including informative features. In this way, we increase the classification accuracy of the random forest model without the need to increase the subspace size from Breiman’s log ( ) 2 1 M + (M is the total number of features in data). Experiments were performed on 9 real life high dimensional datasets with dimensions ranging from 780 to 13195, and with the number of classes ranging from 2 to 25. The results demonstrate that with a subspace size of log ( ) 2 1 M + the random forest using feature weighting for subspace selection significantly outperforms the random forest models with simple random sampling. Classification accuracy is increased by 19%, on average. The maximum increase was 56%. A statistical analysis on the distributions of informative features in all subspaces of the random forest models was also conducted. The results reveal that the proportion of informative features in the subspaces selected with the feature weighting method was much higher than that in the subspaces selected with simple random sampling. This explains the effectiveness of this new subspace selection method in improving classification performance of random forest models in high dimensional data. Compared to using the t-test for feature weighting, the proposed method can be used to build random forest models from high dimensional data with multiple classes, generalizing Amaratunga’s method and increasing its applicability. Our new feature weighting method for subspace selection can build more accurate random forest models than the original method proposed by Breiman without the need to increase the size of the subspaces. The idea of weighting features with chisquare statistic was first used to detect search interfaces from hidden web pages with random forest in Ye et al. (2008). In that application, the form data of web pages represented in HTML was extracted with a web crawler. The web pages were classified into two classes, one containing a search window form and the other without a search window form. Features describing the forms in the pages were extracted from the HTML data. Chi-square statistic was computed to measure the correlation between a feature and the class label, and used as the weight of the feature. To generate multiple trees for random forest, the training data was sampled with replacement and multiple data subsets were created. In each sampled data, a subset of features was randomly selected with respect to the feature weights. The larger the weight of a feature, the more likely the feature was selected in the subsets of features. An existing decision tree algorithm was used to generate a random forest from the sample datasets. The results on small datasets with a couple of hundred features showed that there was no obvious advantage of this feature weighting method in classification accuracy. Further investigating the idea in Ye et al. (2008), in this paper, we study the random forest algorithm with feature weighting method for subspace selection in classifying very high dimensional data with thousands of features. In the new random forest algorithm, we calculate feature weight and use weighted sampling to randomly select features for subspaces at each node in building the individual trees. This feature weighted subspace selection method increases the randomness in growing the individual trees and in the diversity of the component trees. Experimental results have shown that the effectiveness of random forest model with our new algorithm is obvious in classifying very high dimensional data. This paper is organized as follows. In Section 2, we give a brief analysis of random forests on high dimensional data. In Section 3, we present the feature weighting method for subspace selection, and give a new random forest algorithm. Section 4 summarizes four measures to evaluate random forest models. We present experimental results on 9 real life high dimensional datasets in Section 5. Section 6 contains our conclusions. International Journal of Data Warehousing and Mining, 8(2), 44-63, April-June 2012 47 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2. RANDOM FORESTS FOR HIGH DIMENSIONAL DATA Random forests are suitable for classification of large high dimensional data. The following advantages can be identified in comparison with other classification algorithms. 1. As an ensemble model in which each component classifier is built from a subspace of data it is capable of modeling classes in subspaces. 2. Large datasets can be handled efficiently because of the use of decision tree induction to build the component classifiers. 3. High dimensional data is well handled for multi-class tasks such as classifying text data which have many categories. 4. The component classifiers within the ensemble can be built in parallel in a distributed environment, significantly reducing the time for creating a random forest model from large data. There are two general methods for the selection of subspaces of features when growing decision trees for random forest models. The first method proposed by Ho (1998) is to randomly sample a subset of features from the entire feature set. The sampled training data for the decision tree only contains the selected features and the decision tree only considers these features as candidates for splitting nodes. With this method an existing decision tree algorithm can be directly used to build the individual component trees without modification. The second method as proposed by Breiman (2001) is to sample both objects and features from the entire training data. To create the training data for building a component decision tree we first randomly sample the objects from the full training dataset, often by sampling with replacement. To grow a tree from the sampled data, at each node we randomly sample the features to be used as the candidate features for splitting that specific node. This double sampling method (objects and features) increases the randomness in growing the individual trees and in the diversity of the component trees. However, it requires a modification to the existing implementation of a decision tree algorithm to include the subspace sampling function for each node of the tree. The size of the feature subspaces affects both the efficiency of building the random forest and the performance of the resulting model. Ho’s subspace approach uses half of the features in the dataset, while Breiman suggests selecting log ( ) 2 1 M + features in a subspace, where M is the number of independent features in the training dataset. Both sizes work well on data with a small number of features, where small might be less than 100 features. However, they both become problematic when the number of features might be in the hundreds or thousands. Such data are no longer rare in many application domains. When presented with very high dimensional data we often find that very many of the features are uninformative and the percentage of truly informative features is small. In such a circumstance Ho’s subspace size of half the number of features is too large. A considerable computational cost is incurred and the resulting decision trees will be highly correlated. In contrast, for very high dimensional data, Breiman’s subspace size of log ( ) 2 1 M + is too small. With a simple random sampling, selecting this few features will invariably result in few, and quite likely no, informative features being included in the subspace. This will result in many weak trees. According to Breiman’s generalization error bound indicator, increasing the correlation between trees or decreasing the strength of component trees will increase the generalization error bound of random forests. Using simple random sampling for very high dimensional data degrades the performance of the individual decision trees (Amaratunga, Cabrera, & Lee, 2008). This is because almost all subspaces are likely to consist mostly (or completely) of uninformative features. To build decision trees with improved performance it is important to select subspaces containing more informative features. 48 International Journal of Data Warehousing and Mining, 8(2), 44-63, April-June 2012 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Amaratunga introduced a feature weighted method for subspace sampling instead of simple random sampling (Amaratunga, Cabrera, & Lee, 2008). In this method, a two-sample t-test between a feature and the class is used to score each feature. Informative features are scored high. The score is then used as the weight of the feature and considered as the probability for the feature to be selected. With this feature weighted sampling method, informative features have a higher chance to be selected in subspaces. However, many practical problems are multi-class and Amaratunga’s method can only be applied to data with two classes due to his use of the two-sample t-test to score features. To extend the concept to multi-class problems we propose a new feature weighting method leading to a new random forest algorithm. 3. FEATURE WEIGHTING FOR SUBSPACE SELECTION In this section we present the feature weighting method for subspace selection in random forests. We discuss the methods to calculate the feature weights from training data, and introduce an algorithm that uses the feature weighting method to sample subspaces in building a random forest. The algorithm is an extension to Breiman’s classical random forest algorithm for classification models. 3.1. Notation Let Y be the class (the target feature) with q distinct class labels yj for j=1,...,q. For the purposes of our discussion we consider a single categoric feature A in dataset D with p distinct categoric values. We denote the distinct values by ai for i=1,...,p. Numeric features are discretized into p intervals with a supervised discretization method (Quinlan, 1996; Engle & Gangopadhyay, 2010). Assume D has val objects. The size of the subset of D satisfying the condition that A= ai and Y=yj is denoted valij. Considering all combinations of the categoric values of A and the labels of Y, we can obtain a contingence table (Pearson, 1904) of A against Y as shown in Table 1. The far right column contains the marginal totals for feature A:
منابع مشابه
Hybrid weighted random forests for classifying very high-dimensional data
Random forests are a popular classification method based on an ensemble of a single type of decision trees from subspaces of data. In the literature, there are many different types of decision tree algorithms, including C4.5, CART, and CHAID. Each type of decision tree algorithm may capture different information and structure. This paper proposes a hybrid weighted random forest algorithm, simul...
متن کاملClassifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees
The random forests method is one of the most successful ensemble methods. However, random forests do not have high performance when dealing with very-high-dimensional data in presence of dependencies. In this case one can expect that there exist many combinations between the variables and unfortunately the usual random forests method does not effectively exploit this situation. We here investig...
متن کاملStratified sampling for feature subspace selection in random forests for high dimensional data
For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for rand...
متن کاملAdaptively Discovering Meaningful Patterns in High-Dimensional Nearest Neighbor Search
To query high-dimensional databases, similarity search (or k nearest neighbor search) is the most extensively used method. However, since each attribute of high dimensional data records only contains very small amount of information, the distance of two high-dimensional records may not always correctly reflect their similarity. So, a multi-dimensional query may have a k-nearest-neighbor set whi...
متن کاملRandom Forests with Missing Values in the Covariates
In Random Forests [2] several trees are constructed from bootstrapor subsamples of the original data. Random Forests have become very popular, e.g., in the fields of genetics and bioinformatics, because they can deal with high-dimensional problems including complex interaction effects. Conditional Inference Forests [8] provide an implementation of Random Forests with unbiased variable selection...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJDWM
دوره 8 شماره
صفحات -
تاریخ انتشار 2012